This is some short investgation into the effects of unceranty on spatial correlations. I is not comprehensive, and certainly should not be seen as authoritative, but it raises some questions (at least for me). The results seem a bit extreme to me, but I wanted to write it up here so that I have a reference to where I can look up the code and results.
In the reviews to Nicolas M. Kuehn and Abrahamson (2020), where we developed spatial correlations for event terms, Peter Stafford pointed out that event terms are associated with uncertainty (which can be quite substantial for events that are not well recorded), and ths should not be ignored for the spatial correlation models. At the very least, it affects the scale parameter, as ignoring uncertainty wil lead to an understmation of \(\tau\). While looking into (non-stationary) spatial coelaton models for site terms, I noticed that the length scales I got when just using the estimated site terms were shorter than when estimating a full model (i.e. sie terms and spatial correlation of site terms at the same time). In nonergodic models based on varyng coefficient models (e.g. Landwehr et al. 2016; Lavrentiadis, Abrahamson, and Kuehn 2021; N. Kuehn 2021), we estimate spatial corelations of constants and the model at the same time, so they take uncertainty in event and site terms nto account. The question I had was what happends for spatial correlation models of within-event (and within-site) residuals. Since event terms and site terms are uncertain, that means that he residuals when taking them out (i.e. wihin-event residual \(\delta W\) and within-event/within-site residuals \(\delta WS\)) should be uncertain as well. Typically, spatial correlations models like Jayaram and Baker (2009) are estimated for within-event residuals of well-recorded events, so the uncerainty of \(\delta W\) should be small, but I’, not sure anyone has ever looked into this. \(\delta WS\) should be more uncertain, since site terms are generally more uncertain (often less recordings per site), so they might be more affected. I think more models are movig towards partitioning into site terms, so one can probably expect spatial corelation models for \(\delta WS\) to be coming out.
Here, I jsut simulate some data (spatially corelated observations from events/stations), and then fit models to see if I the true parameters can be estimated, using the tru data generating model, as well as a partitioned model. Model estimation is doe with Stan (Carpenter et al. 2017), and INLA (Rue, Martino, and Chopin 2009) is used for partitioning the residuals. The spatial correlation Stan model is similar to the one of Stafford et al. (2019); see also in Nicolas M. Kuehn and Stafford (2021).
First, we load the packages required for the analysis, ad set some things like plotting defaults, cmdstan paths and so on.
# load required packages
library(cmdstanr)
library(posterior)
library(bayesplot)
library(ggplot2)
library(INLA)
library(gstat)
library(matrixStats)
theme_set(theme_gray() +
theme(panel.grid.minor = element_blank(),
axis.title = element_text(size = 20),
axis.text = element_text(size = 14),
plot.title = element_text(size = 30),
plot.subtitle = element_text(size = 20),
legend.text = element_text(size = 20),
legend.title = element_text(size = 20)
)
)
set_cmdstan_path('/Users/nico/GROUNDMOTION/SOFTWARE/cmdstan-2.29.0')
cmdstan_path()## [1] "/Users/nico/GROUNDMOTION/SOFTWARE/cmdstan-2.29.0"
cmdstan_version()## [1] "2.29.0"
dir_stan <- "/Users/nico/GROUNDMOTION/PROJECTS/NONERGODIC/SCENARIO_MAPS/COREG/STAN_SPATCORR"LongLatToUTM<-function(x,y,zone){
xy <- data.frame(ID = 1:length(x), X = x, Y = y)
coordinates(xy) <- c("X", "Y")
proj4string(xy) <- CRS("+proj=longlat +datum=WGS84") ## for example
res <- spTransform(xy, CRS(paste("+proj=utm +zone=",zone," ellps=WGS84",sep='')))
return(as.data.frame(res))
}
`%notin%` <- Negate(`%in%`)The simulation is done for realistic data distributions (number of evets, number of stations, number of records). Here, I use the Califorian subset of NGA W2 (Ancheta et al. 2014), used by Abrahamson, Silva, and Kamai (2014). The data is read in, then coordinates are transformed to UTM coordinates, and event/sation indces are set to integer valus. Finally, we make a data frame of the whle data with the important variables.
###### Data
data <- read.csv(file.path("data_calif_pga_t10s.csv"))
long_lat_ev <- cbind(data$Hypocenter.Longitude..deg., data$Hypocenter.Latitude..deg.)
long_lat_stat <- cbind(data$Lon_stat, data$Lat_stat)
#convert from degrees to UTM
utm_zone <- 10
co_eq_utm <- LongLatToUTM(long_lat_ev[,1],long_lat_ev[,2], utm_zone)[,c(2,3)]/1000
co_stat_utm <- LongLatToUTM(long_lat_stat[,1],long_lat_stat[,2], utm_zone)[,c(2,3)]/1000
eqid <- data$eqid
statid <- data$STATID
n_rec <- length(eqid)
n_eq <- length(unique(eqid))
n_stat <- length(unique(statid))
#SET STATION INDICES
stat <- array(0,n_rec)
statid_uni <- unique(statid)
for (i in 1:n_stat){
indi <- ifelse(statid %in% statid_uni[i],i,0)
stat <- stat + indi
}
#SET EVENT INDICES
eq <- array(0,n_rec)
eqid_uni <- unique(eqid)
for (i in 1:n_eq) {
indi <- ifelse(eqid %in% eqid_uni[i],i,0)
eq <- eq + indi
}
data_reg <- data.frame(eq = eq,
stat = stat,
eqid = eqid,
statid = statid,
Rrup = data$Rrup,
Lon_stat = data$Lon_stat,
Lat_stat = data$Lat_stat,
Lon_ev = data$Hypocenter.Longitude..deg.,
Lat_ev = data$Hypocenter.Latitude..deg.,
X_eq = co_eq_utm[,1],
Y_eq = co_eq_utm[,2],
X_stat = co_stat_utm[,1],
Y_stat = co_stat_utm[,2]
)Only a sbubset is used, with a minmum number of 10 records per event, ad getting rid of ``colocated’’ stations (I combined stations tha are very close together to have the same station id, which causes some poblems).
ev_list <- as.numeric(names(which(table(data_reg$eqid) > 10)))
dat_used <- data.frame()
for(i in 1:length(ev_list)) {
tmp <- data_reg[data_reg$eqid == ev_list[i]
& data_reg$Rrup <= 100,]
tmp2 <- names(which(table(tmp$statid) > 1))
tmp <- tmp[tmp$statid %notin% tmp2,]
dat_used <- rbind(dat_used, tmp)
}
#dat_used <- dat_used_cpo
dim(dat_used)## [1] 7712 13
n_rec_used <- nrow(dat_used)
n_eq_used <- length(unique(dat_used$eq))
n_stat_used <- length(unique(dat_used$stat))
#SET EVENT INDICES
eqid <- dat_used$eqid
eq <- array(0,n_rec_used)
eqid_uni <- unique(eqid)
for (i in 1:n_eq_used) {
indi <- ifelse(eqid %in% eqid_uni[i],i,0)
eq <- eq + indi
}
dat_used$eq <- eq
#SET STATION INDICES
statid <- dat_used$statid
stat <- array(0,n_rec_used)
statid_uni <- unique(statid)
for (i in 1:n_stat_used) {
indi <- ifelse(statid %in% statid_uni[i],i,0)
stat <- stat + indi
}
dat_used$stat <- stat
#### find indices of events, neeed to be sorted by event ids
start <- vector(length = n_eq_used)
end <- vector(length = n_eq_used)
num_eq <- vector(length = n_eq_used)
for(i in 1:n_eq_used) {
start[i] <- min(which(eq == i))
end[i] <- max(which(eq == i))
num_eq[i] <- end[i] - start[i] + 1
}
max_num <- max(num_eq)
ind_na_eq <- array(0, dim = c(n_eq_used, max_num))
len_na_eq <- vector(mode = "numeric", length = n_eq_used)
for(i in 1:n_eq_used) {
#tmp <- which(!is.na(y_target[eq == i])) can be used to flter out NA
tmp <- which(eq == i)
ind_na_eq[i, 1:length(tmp)] <- tmp
len_na_eq[i] <- length(tmp)
}Here, some data is generated accodng to the following model \[ \delta B \sim N(0, \tau) \\ \delta S \sim N(0, \phi_{S2S}) \\ \ell \sim LN(\mu_{\ln \ell}, \sigma_{\ln \ell}) \\ Y \sim GP(\delta B + \delta S, k(\vec{x}_s,\vec{x}_s')) \\ k(\vec{x}_s,\vec{x}_s')) = \delta_{ij,eq} \left(\theta_1^2 \exp \left[- \frac{|\vec{x}_s - \vec{x}_s')|}{\ell}\right] + \theta_2^2 \delta_{ij,rec}\right) \] So basically, sample some event terms and stations terms from normal distributions, and then sample observations from a GP with mean event term plus station term, and exponential covariance function with some observation noise. The covariance matrix is zero if records ae not from the same event (\(\delta_{ij,eq}\) is supposed to be a kronecker symbol for events, and similarly \(\delta_{rec}\) for records; not the best notation, but should work). We also use different length scales for the events, which are distributed according to a lognormal distribution. We have \(\phi_{SS}^2 = \theta_1^2 + \theta_2^2\), which is modeled as \(\theta_1^2 = \omega \phi_{SS}^2\) and \(\theta_2^2 = (1 - \omega) \phi_{SS}^2\), and \(\omega\) is beween 0 and 1.
The reason to have different length scales for each event has less to do with the simulaio, and moe to do with the fact tha I wanted to see whether it is possible to estimate the diffeent length scales and associated hyperparameters. This offers the possibility to do hat on real data, and possibly look into spatial correlations of length scales (but I am geting ahead of myself).
Below we define the hyperparameters for simulation.
sim_name <- "sim-l"
phi_ss_sim <- 0.5
tau_sim <- 0.4
phi_s2s_sim <- 0.4
mu_log_ell <- 2.3 # approx log(10)
sigma_log_ell <- 0.5
n_wp <- 500
n_sample <- 200
data_list_sim <- list(N = n_rec_used,
NEQ = n_eq_used,
NSTAT = n_stat_used,
M = max_num,
eq = dat_used$eq,
stat = dat_used$stat,
X = cbind(dat_used$X_stat, dat_used$Y_stat),
sigma_rec = phi_ss_sim,
sigma_eq = tau_sim,
sigma_stat = phi_s2s_sim,
omega = 0.75,
mu_log_ell = mu_log_ell,
sigma_log_ell = sigma_log_ell,
ind_eq = ind_na_eq,
len_eq = len_na_eq
)Synthetic data are simulated with Stan.
model <- 'gmm_sim_spatcorr_l'
file <- file.path(dir_stan, sprintf("%s.stan", model))
mod <- cmdstan_model(file, include_paths = c(file.path(dir_stan, 'stan_include')))
mod$print()## /*
## * Model to simulate total residuals
## */
##
## #include functions.stan
##
## data {
## int N;
## int NEQ;
## int NSTAT;
## int M;
##
## array[N] int<lower=1,upper=NEQ> eq;
## array[N] int<lower=1,upper=NSTAT> stat;
##
## array[N] vector[2] X;
##
## array[NEQ, M] int ind_eq;
## array[NEQ] int len_eq;
##
## real<lower=0> sigma_rec;
## real<lower=0> sigma_eq;
## real<lower=0> sigma_stat;
## real mu_log_ell;
## real<lower=0> sigma_log_ell;
## real<lower=0,upper=1> omega; // variance weights for ph_SS
## }
##
## transformed data {
## real theta = sqrt(omega .* square(sigma_rec));
## real theta2 = sqrt((1 - omega) .* square(sigma_rec));
## }
##
## parameters {
##
## }
##
## model {
## }
##
## generated quantities {
## array[NEQ] real eqterm = normal_rng(rep_vector(0, NEQ), sigma_eq);
## array[NEQ] real ell = exp(normal_rng(rep_vector(mu_log_ell, NEQ), sigma_log_ell));
## array[NSTAT] real statterm = normal_rng(rep_vector(0, NSTAT), sigma_stat);;
## vector[N] Y_sim;
##
## {
## vector[N] mu = to_vector(eqterm)[eq] + to_vector(statterm)[stat];
##
## for(i in 1:NEQ) {
## matrix[len_eq[i], len_eq[i]] L;
## L = calc_L2_M1(X[ind_eq[i, 1:len_eq[i]]], len_eq[i], theta, ell[i], theta2);
##
## Y_sim[ind_eq[i, 1:len_eq[i]]] = multi_normal_cholesky_rng(mu[ind_eq[i, 1:len_eq[i]]], L);
## }
##
## }
## }
fit_sim <- mod$sample(
data = data_list_sim,
seed = 5618,
chains = 1,
iter_sampling = 1,
iter_warmup = 0,
fixed_param = TRUE
)## Running MCMC with 1 chain...
##
## Chain 1 Iteration: 1 / 1 [100%] (Sampling)
## Chain 1 finished in 0.1 seconds.
draws_sim <- fit_sim$draws()
y_sim <- as.vector(subset(draws_sim, variable=c('Y_sim'), regex=TRUE))
deltaB_sim <- as.vector(subset(draws_sim, variable=c('eqterm'), regex=TRUE))
deltaS_sim <- as.vector(subset(draws_sim, variable=c('statterm'), regex=TRUE))
ell_sim <- as.vector(subset(draws_sim, variable=c('ell'), regex=TRUE))Now, the model parameters are estimated, using the full model, i.e. event terms, station terms, all hyperparameters, length scales. The Stan code is simple, just moving parameter declarations to th parameters block, the sampling statements to the model block, and addig some priors. I ran the model in a different session, so I’m just reading in the results now (it takes quite long to run).
data_list_sim$Y <- y_sim
model <- 'gmm_partition_spatcorr_l'
file <- file.path(dir_stan, sprintf("%s.stan", model))
mod <- cmdstan_model(file, include_paths = c(file.path(dir_stan, 'stan_include')))
# fit_l <- mod$sample(
# data = data_list_sim,
# seed = 5618,
# chains = 4,
# iter_sampling = n_sample,
# iter_warmup = n_wp,
# refresh = 10,
# max_treedepth = 10,
# adapt_delta = 0.8,
# parallel_chains = 2
# )
# fit_l$cmdstan_diagnose()
# fit_l$save_object(file = file.path(dir_stan, sprintf("fit_%s_%s.RDS", sim_name, model)))
fit_l <- readRDS(file.path(dir_stan, sprintf("fit_%s_%s.RDS", sim_name, model)))
fit_l$time()## $total
## [1] 9999.454
##
## $chains
## chain_id warmup sampling total
## 1 1 3284.50 1559.03 4843.53
## 2 2 3165.42 1547.90 4713.33
## 3 3 3149.80 1936.78 5086.57
## 4 4 3389.73 1759.37 5149.10
Results look pretty good. Hyperparameters are pretty well estimated.
draws_l <- fit_l$draws()
summarise_draws(subset(draws_l, variable=c('omega', 'sigma', 'mu_log_ell', 'theta'), regex=TRUE))## # A tibble: 8 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 omega 0.750 0.749 0.0353 0.0362 0.692 0.808 1.00 593. 1127.
## 2 sigma_rec 0.500 0.500 0.00523 0.00518 0.491 0.509 1.00 1448. 1496.
## 3 sigma_eq 0.411 0.410 0.0223 0.0220 0.375 0.449 1.00 3262. 1642.
## 4 sigma_stat 0.386 0.386 0.0113 0.0111 0.367 0.404 1.00 1393. 1284.
## 5 sigma_log_ell 0.516 0.514 0.0863 0.0859 0.379 0.662 1.03 122. 223.
## 6 mu_log_ell 2.15 2.15 0.104 0.109 1.98 2.32 1.00 441. 900.
## 7 theta 0.433 0.433 0.0114 0.0115 0.414 0.451 1.00 861. 1363.
## 8 theta2 0.250 0.250 0.0179 0.0181 0.219 0.278 1.00 549. 1137.
Now, I partition the data into event, station terms and residuals \(\delta WS\), using INLA (I also tried lmer, and results are pretty much the same). Then, one can run a spatial correlation model on just the residuals. The standard deviaions match quite well the theoretical values.
dat_used$Y_sim <- y_sim
# priors
prior_prec_tau_lg <- list(prec = list(prior = "loggamma", param = c(2, 0.5)))
prior_prec_phiS2S_lg <- list(prec = list(prior = "loggamma", param = c(2, 0.5)))
prior_prec_phiSS_lg <- list(prec = list(prior = "loggamma", param = c(2, 0.5)))
prior.fixed <- list(mean.intercept = 0, prec.intercept = 5)
###### partition into evet terms/station terme
fit_inla_sim <- inla(Y_sim ~ 1 + f(eq, model = "iid", hyper = prior_prec_tau_lg)
+ f(stat, model = "iid",hyper = prior_prec_phiS2S_lg),
data = dat_used,
family="gaussian",
control.fixed = prior.fixed,
control.family = list(hyper = list(prec = prior_prec_phiSS_lg)),
control.compute = list(dic = TRUE, waic = TRUE)
)
print(1/sqrt(fit_inla_sim$summary.hyperpar$mean))## [1] 0.4893895 0.4314864 0.3832127
dat_used$deltaWS_sim <- y_sim - fit_inla_sim$summary.fitted.values$mean
print(c(mean(dat_used$deltaWS_sim), sd(dat_used$deltaWS_sim)))## [1] 6.483488e-06 4.567840e-01
Below is a histogram of the standard deviaions of the fitted values. Since the residuals are just the observations (constant values) minus the fitted values, this should also be the standard deviations of the within-event/within-station residuals. Most are between 0.1 and 0.15, which is not small for ground-motion data.
df_plot <- data.frame(sd_dws = fit_inla_sim$summary.fitted.values$sd)
ggplot(df_plot) +
geom_histogram(mapping = aes(sd_dws))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now, the spatial correlation models are estimated (using Stan) with the within-event/within-station residuals as target variable. The Stan code is similar to before, just getting rid of the event/station terms and associated parameters.
data_list_sim$Y <- dat_used$deltaWS_sim
model <- 'gmm_spatcorr_l'
file <- file.path(dir_stan, sprintf("%s.stan", model))
mod <- cmdstan_model(file, include_paths = c(file.path(dir_stan, 'stan_include')))
# fit_dws_l <- mod$sample(
# data = data_list_sim,
# seed = 5618,
# chains = 4,
# iter_sampling = n_sample,
# iter_warmup = n_wp,
# refresh = 10,
# max_treedepth = 10,
# adapt_delta = 0.8,
# parallel_chains = 2
# )
# fit_dws_l$cmdstan_diagnose()
# fit_dws_l$save_object(file = file.path(dir_stan, sprintf("fit_%s_%s.RDS", sim_name, model)))
fit_dws_l <- readRDS(file.path(dir_stan, sprintf("fit_%s_%s.RDS", sim_name, model)))
fit_dws_l$time()## $total
## [1] 9542.262
##
## $chains
## chain_id warmup sampling total
## 1 1 4483.79 378.343 4862.13
## 2 2 4349.74 345.786 4695.53
## 3 3 3994.38 254.165 4248.55
## 4 4 4461.45 214.078 4675.53
Results are not as good as before. In particular, the length scale is underestimated.
draws_dws_l <- fit_dws_l$draws()
summarise_draws(subset(draws_dws_l, variable=c('omega', 'sigma', 'mu_log_ell', 'theta'), regex=TRUE))## # A tibble: 6 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 omega 0.716 0.716 0.0330 0.0348 0.663 0.768 1.01 323. 464.
## 2 sigma_rec 0.461 0.461 0.00416 0.00416 0.454 0.467 1.01 1094. 732.
## 3 sigma_log_ell 0.460 0.454 0.0777 0.0795 0.344 0.599 1.03 59.3 188.
## 4 mu_log_ell 1.87 1.87 0.0950 0.0972 1.71 2.01 1.02 84.7 186.
## 5 theta 0.389 0.390 0.0101 0.0103 0.373 0.406 1.00 561. 687.
## 6 theta2 0.245 0.245 0.0141 0.0145 0.222 0.268 1.01 287. 397.
Now, estimation is done with traditional geostatistics using a variogram. Variograms are calculated and fitted with package gstat. For each event, a variogram is calculated, and the exponential model is fitted. I’m just using the default settings of variogram, which gives some warnings for fiting for multiple events (in particular for small number of records).
res_variogram <- matrix(nrow = n_eq_used, ncol = 5)
i <- 1
for(i in 1:n_eq_used) {
dat_eq <- dat_used[dat_used$eq == i,]
coordinates(dat_eq) = ~X_stat + Y_stat
v_lin <- variogram(deltaWS_sim ~ 1, dat_eq)
fit_v_lin <- fit.variogram(v_lin, model = vgm("Exp"))
res_variogram[i,] <- c(i, fit_v_lin$psill, fit_v_lin$range)
}Below are histograms of the estimated (logarithmic) length-scales and nugget values.
p <- ggplot(as.data.frame(res_variogram)) +
geom_histogram(mapping = aes(V5)) +
scale_x_log10() +
labs(x="estimated length scales", y="#")
print(p)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p <- ggplot(as.data.frame(res_variogram)) +
geom_histogram(mapping = aes(V2)) +
scale_x_log10() +
labs(x="estimated nugget values", y="#")
print(p)## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 96 rows containing non-finite values (stat_bin).
Now some plots of variograms of well recorded events (number of records more than 80).
omega_sim <- 0.75
theta1_sim_sq <- omega_sim * phi_ss_sim^2
theta2_sim_sq <- (1-omega_sim) * phi_ss_sim^2
eq80 <- as.numeric(names(which(table(dat_used$eq) > 80)))
for(i in eq80) {
dat_eq <- dat_used[dat_used$eq == i,]
coordinates(dat_eq) = ~X_stat + Y_stat
v_lin <- variogram(deltaWS_sim ~ 1, dat_eq)
fit_v_lin <- fit.variogram(v_lin, model = vgm("Exp"))
preds <- variogramLine(fit_v_lin, maxdist = max(v_lin$dist))
preds$gamma_sim <- theta2_sim_sq + theta1_sim_sq * (1 - exp(-preds[,1] / ell_sim[i]))
if(abs(log(ell_sim[i]) - log(fit_v_lin$range[2])) < 0.5) {
col = 'black'
} else {
col = 'blue'
}
p <- ggplot() +
geom_point(data = v_lin, aes(x = dist, y = gamma, size = np)) +
geom_line(data = preds, aes(x = dist, y = gamma), color = col) +
geom_line(data = preds, aes(x = dist, y = gamma_sim), color = 'red') +
labs(title = paste('event ',i), x = 'h' , y = 'variogram')
print(p)
}## Warning in fit.variogram(v_lin, model = vgm("Exp")): No convergence after 200
## iterations: try different initial values?
Here are some plots. Nothing fancy, just plotting estimated values against true values. I should probably add error bars, but this does it for now.
tmp <- as_draws_matrix(subset(draws_l, variable=c('^log_ell'), regex=TRUE))
tmp2 <- as_draws_matrix(subset(draws_dws_l, variable=c('^log_ell'), regex=TRUE))
df_plot <- data.frame(ell_sim = ell_sim,
ell_est = exp(colMeans(tmp))[1:n_eq_used],
ell_est_dws = exp(colMeans(tmp2))[1:n_eq_used],
ell_vario = res_variogram[,5],
n_eq = as.numeric(table(eq)),
dell_est = colMeans(tmp)[1:n_eq_used] - log(ell_sim),
dell_est_dws = colMeans(tmp2)[1:n_eq_used] - log(ell_sim),
dell_vario = log(res_variogram[,5]) - log(ell_sim),
log_ell_est_sd = colSds(tmp)[1:n_eq_used]
)
p <- ggplot() +
geom_point(data = df_plot, aes(x = ell_sim, y = ell_est)) +
geom_point(data = df_plot, aes(x = ell_sim, y = ell_est_dws), color = 'red') +
geom_point(data = df_plot, aes(x = ell_sim, y = ell_vario), color = 'orange') +
scale_y_log10() + scale_x_log10() +
labs(x="true length scale", y="estimated length scale")
print(p)
p <- ggplot() +
geom_point(data = df_plot, aes(x = n_eq, y = dell_est)) +
geom_point(data = df_plot, aes(x = n_eq, y = dell_est_dws), color = 'red') +
geom_point(data = df_plot, aes(x = n_eq, y = dell_vario), color = 'orange') +
geom_errorbar(data = df_plot, aes(x = n_eq, ymin = dell_est - log_ell_est_sd,
ymax = dell_est + log_ell_est_sd)) +
scale_x_log10() +
labs(x="number of records", y="residual log length scale")
print(p)
p <- ggplot() +
geom_point(data = df_plot, aes(x = n_eq, y = log_ell_est_sd)) +
scale_x_log10() +
labs(x="number of records", y="sd of length scale")
print(p)The length scales are quite well estimated with the correct model, but are underestimated (with Stan) when using just the residuals. The variogram values are all over the place, but are ok for well-recorded events (\(N_{rec} > 80\)). One can probably get some sort of visual quality assurance from the variograms. I’m nor too familiar with variogram estimation, and I guess it’s nice to get some warnings (meanin), but it seems there are a lot of little tweaks one can do, and a bad fit might be hard to detect. On the other hand, similar things apply to the GP estimation, as the model based on the within-event residuals converges well, but gves biased length scales. I’m a bit surprised that the variograms give reasonable estimates of the length scales (at least for well-recorded events), even though the uncertainty in the target variable (\(\delta WS\)) is ignored. For the GP estimation, the hierarchical nature (the event-specific length scales are basically random effects) means that the length scales ae somwwhat constrained. Nt sure if ne can do that as well with the variograms – I guess one could calculate variograms, and then fit the same model with some random effects structure to it.
Here, some data is generated according to the following model (same as before, but same length scales for events) \[ \delta B \sim N(0, \tau) \\ \delta S \sim N(0, \phi_{S2S}) \\ Y \sim GP(\delta B + \delta S, k(\vec{x}_s,\vec{x}_s')) \\ k(\vec{x}_s,\vec{x}_s')) = \delta_{ij,eq} \left(\theta_1^2 \exp \left[- \frac{|\vec{x}_s - \vec{x}_s')|}{\ell}\right] + \theta_2^2 \delta_{ij,rec}\right) \] Same model as before, but he lengh scale is the same for each event. Below are the hyperparameters (length scale of 10km).
sim_name <- "sim"
phi_ss_sim <- 0.5
tau_sim <- 0.4
phi_s2s_sim <- 0.4
n_wp <- 500
n_sample <- 200
data_list_sim <- list(N = n_rec_used,
NEQ = n_eq_used,
NSTAT = n_stat_used,
M = max_num,
eq = dat_used$eq,
stat = dat_used$stat,
X = cbind(dat_used$X_stat, dat_used$Y_stat),
sigma_rec = phi_ss_sim,
sigma_eq = tau_sim,
sigma_stat = phi_s2s_sim,
omega = 0.75,
ell = 10,
ind_eq = ind_na_eq,
len_eq = len_na_eq
)Synthetic data are simulated with Stan.
model <- 'gmm_sim_spatcorr'
file <- file.path(dir_stan, sprintf("%s.stan", model))
mod <- cmdstan_model(file, include_paths = c(file.path(dir_stan, 'stan_include')))
mod$print()## /*
## * Model to simulate total residuals
## */
##
## #include functions.stan
##
## data {
## int N;
## int NEQ;
## int NSTAT;
## int M;
##
## array[N] int<lower=1,upper=NEQ> eq;
## array[N] int<lower=1,upper=NSTAT> stat;
##
## array[N] vector[2] X;
##
## array[NEQ, M] int ind_eq;
## array[NEQ] int len_eq;
##
## real<lower=0> sigma_rec;
## real<lower=0> sigma_eq;
## real<lower=0> sigma_stat;
## real<lower=0> ell; // length scales
## real<lower=0,upper=1> omega; // variance weights for ph_SS
## }
##
## transformed data {
## real theta = sqrt(omega .* square(sigma_rec));
## real theta2 = sqrt((1 - omega) .* square(sigma_rec));
## }
##
## parameters {
##
## }
##
## model {
## }
##
## generated quantities {
## array[NEQ] real eqterm = normal_rng(rep_vector(0, NEQ), sigma_eq);
## array[NSTAT] real statterm = normal_rng(rep_vector(0, NSTAT), sigma_stat);;
## vector[N] Y_sim;
##
## {
## vector[N] mu = to_vector(eqterm)[eq] + to_vector(statterm)[stat];
##
## for(i in 1:NEQ) {
## matrix[len_eq[i], len_eq[i]] L;
## L = calc_L2_M1(X[ind_eq[i, 1:len_eq[i]]], len_eq[i], theta, ell, theta2);
##
## Y_sim[ind_eq[i, 1:len_eq[i]]] = multi_normal_cholesky_rng(mu[ind_eq[i, 1:len_eq[i]]], L);
## }
##
## }
## }
fit_sim <- mod$sample(
data = data_list_sim,
seed = 5618,
chains = 1,
iter_sampling = 1,
iter_warmup = 0,
fixed_param = TRUE
)## Running MCMC with 1 chain...
##
## Chain 1 Iteration: 1 / 1 [100%] (Sampling)
## Chain 1 finished in 0.0 seconds.
draws_sim <- fit_sim$draws()
y_sim <- as.vector(subset(draws_sim, variable=c('Y_sim'), regex=TRUE))
deltaB_sim <- as.vector(subset(draws_sim, variable=c('eqterm'), regex=TRUE))
deltaS_sim <- as.vector(subset(draws_sim, variable=c('statterm'), regex=TRUE))Now, the model parameters are estimated, using the full model, i.e. event terms, station terms, all hyperparameters, length scales. The Stan code is simple, just moving parameter declarations to th parameters block, the sampling statements to the model block, and addig some priors. I ran the model in a different session, so I’m just reading in the results now (it takes quite long to run).
data_list_sim$Y <- y_sim
model <- 'gmm_partition_spatcorr'
file <- file.path(dir_stan, sprintf("%s.stan", model))
mod <- cmdstan_model(file, include_paths = c(file.path(dir_stan, 'stan_include')))
# fit_l <- mod$sample(
# data = data_list_sim,
# seed = 5618,
# chains = 4,
# iter_sampling = n_sample,
# iter_warmup = n_wp,
# refresh = 10,
# max_treedepth = 10,
# adapt_delta = 0.8,
# parallel_chains = 2
# )
# fit_l$cmdstan_diagnose()
# fit_l$save_object(file = file.path(dir_stan, sprintf("fit_%s_%s.RDS", sim_name, model)))
fit_l <- readRDS(file.path(dir_stan, sprintf("fit_%s_%s.RDS", sim_name, model)))
fit_l$time()## $total
## [1] 7054.487
##
## $chains
## chain_id warmup sampling total
## 1 1 2677.73 510.926 3188.66
## 2 2 2637.70 513.020 3150.72
## 3 3 3358.46 540.556 3899.02
## 4 4 3019.68 453.479 3473.16
Results look pretty good. Hyperparameters are pretty well estimated.
draws_l <- fit_l$draws()
summarise_draws(subset(draws_l, variable=c('omega', 'sigma', 'ell','theta'), regex=TRUE))## # A tibble: 7 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 omega 0.766 0.766 0.0281 0.0280 0.719 0.813 1.00 686. 552.
## 2 sigma_rec 0.503 0.503 0.00526 0.00483 0.494 0.512 1.01 643. 712.
## 3 sigma_eq 0.411 0.409 0.0227 0.0230 0.374 0.449 1.01 1424. 676.
## 4 sigma_stat 0.391 0.392 0.0113 0.0114 0.373 0.410 0.998 708. 626.
## 5 ell 9.72 9.69 0.635 0.661 8.80 10.9 1.00 697. 770.
## 6 theta 0.440 0.440 0.00991 0.00950 0.424 0.456 1.01 833. 575.
## 7 theta2 0.243 0.243 0.0146 0.0139 0.219 0.266 1.00 626. 579.
Now, the model parameters are estimated, using the full model, i.e. event terms, station terms, all hyperparameters, length scales. Different length scales per event are estimated, but not as random effects.
data_list_sim$Y <- y_sim
model <- 'gmm_partition_spatcorr_l2'
file <- file.path(dir_stan, sprintf("%s.stan", model))
mod <- cmdstan_model(file, include_paths = c(file.path(dir_stan, 'stan_include')))
# fit_l2 <- mod$sample(
# data = data_list_sim,
# seed = 5618,
# chains = 4,
# iter_sampling = n_sample,
# iter_warmup = n_wp,
# refresh = 10,
# max_treedepth = 10,
# adapt_delta = 0.8,
# parallel_chains = 2
# )
# fit_l$cmdstan_diagnose()
# fit_l$save_object(file = file.path(dir_stan, sprintf("fit_%s_%s.RDS", sim_name, model)))
fit_l2 <- readRDS(file.path(dir_stan, sprintf("fit_%s_%s.RDS", sim_name, model)))
fit_l2$time()## $total
## [1] 12355.1
##
## $chains
## chain_id warmup sampling total
## 1 1 5194.38 496.940 5691.32
## 2 2 5030.48 498.111 5528.59
## 3 3 5774.90 757.611 6532.51
## 4 4 6067.06 592.382 6659.45
Results look ok. Hyperparameters are pretty well estimated.
draws_l2 <- fit_l2$draws()
summarise_draws(subset(draws_l2, variable=c('omega', 'sigma','theta'), regex=TRUE))## # A tibble: 6 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 omega 0.815 0.816 0.0265 0.0246 0.771 0.859 1.00 566. 579.
## 2 sigma_rec 0.497 0.497 0.00519 0.00525 0.488 0.506 1.01 975. 521.
## 3 sigma_eq 0.414 0.414 0.0218 0.0227 0.380 0.450 0.999 1685. 747.
## 4 sigma_stat 0.392 0.392 0.0111 0.0111 0.373 0.409 1.00 652. 608.
## 5 theta 0.449 0.449 0.00936 0.00921 0.434 0.464 1.00 965. 713.
## 6 theta2 0.213 0.213 0.0152 0.0140 0.187 0.237 1.00 506. 627.
Now, I partition the data into event, station terms and residuals \(\delta WS\), using INLA (I also tried lmer, and results are pretty much the same). Then, one can run a spatial correlation model on just the residuals. The standard deviaions match quite well the theoretical values.
dat_used$Y_sim <- y_sim
# priors
prior_prec_tau_lg <- list(prec = list(prior = "loggamma", param = c(2, 0.5)))
prior_prec_phiS2S_lg <- list(prec = list(prior = "loggamma", param = c(2, 0.5)))
prior_prec_phiSS_lg <- list(prec = list(prior = "loggamma", param = c(2, 0.5)))
prior.fixed <- list(mean.intercept = 0, prec.intercept = 5)
###### partition into evet terms/station terme
fit_inla_sim <- inla(Y_sim ~ 1 + f(eq, model = "iid", hyper = prior_prec_tau_lg)
+ f(stat, model = "iid",hyper = prior_prec_phiS2S_lg),
data = dat_used,
family="gaussian",
control.fixed = prior.fixed,
control.family = list(hyper = list(prec = prior_prec_phiSS_lg)),
control.compute = list(dic = TRUE, waic = TRUE)
)
print(1/sqrt(fit_inla_sim$summary.hyperpar$mean))## [1] 0.4902325 0.4299815 0.3871449
dat_used$deltaWS_sim <- y_sim - fit_inla_sim$summary.fitted.values$mean
print(c(mean(dat_used$deltaWS_sim), sd(dat_used$deltaWS_sim)))## [1] 7.519280e-06 4.574384e-01
Below is a histogram of the standard deviaions of the fitted values. Since the residuals are just the observations (constant values) minus the fitted values, this should also be the standard deviations of the within-event/within-station residuals. Most are between 0.1 and 0.15, which is not small for ground-motion data.
df_plot <- data.frame(sd_dws = fit_inla_sim$summary.fitted.values$sd)
ggplot(df_plot) +
geom_histogram(mapping = aes(sd_dws))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Now, the spatial correlation models are estimated (using Stan) with the within-event/within-station residuals as target variable. The Stan code is similar to before, just getting rid of the event/station terms and associated parameters.
data_list_sim$Y <- dat_used$deltaWS_sim
model <- 'gmm_spatcorr'
file <- file.path(dir_stan, sprintf("%s.stan", model))
mod <- cmdstan_model(file, include_paths = c(file.path(dir_stan, 'stan_include')))
# fit_dws_l <- mod$sample(
# data = data_list_sim,
# seed = 5618,
# chains = 4,
# iter_sampling = n_sample,
# iter_warmup = n_wp,
# refresh = 10,
# max_treedepth = 10,
# adapt_delta = 0.8,
# parallel_chains = 2
# )
# fit_dws_l$cmdstan_diagnose()
# fit_dws_l$save_object(file = file.path(dir_stan, sprintf("fit_%s_%s.RDS", sim_name, model)))
fit_dws_l <- readRDS(file.path(dir_stan, sprintf("fit_%s_%s.RDS", sim_name, model)))
fit_dws_l$time()## $total
## [1] 1705.789
##
## $chains
## chain_id warmup sampling total
## 1 1 676.821 183.193 860.014
## 2 2 633.775 172.077 805.852
## 3 3 643.694 250.662 894.356
## 4 4 570.488 269.626 840.114
Results are not as good as before. In particular, the length scale is underestimated.
draws_dws_l <- fit_dws_l$draws()
summarise_draws(subset(draws_dws_l, variable=c('omega', 'sigma', 'ell', 'theta'), regex=TRUE))## # A tibble: 5 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 omega 0.727 0.727 0.0279 0.0296 0.682 0.773 1.00 599. 520.
## 2 sigma_rec 0.464 0.464 0.00436 0.00441 0.457 0.471 1.01 684. 549.
## 3 ell 7.85 7.81 0.516 0.490 7.06 8.70 1.01 574. 459.
## 4 theta 0.395 0.395 0.00900 0.00923 0.381 0.410 1.00 665. 528.
## 5 theta2 0.242 0.242 0.0122 0.0130 0.222 0.261 1.00 590. 578.
Now, estimation is done with traditional geostatistics using a variogram, same as before.
res_variogram <- matrix(nrow = n_eq_used, ncol = 5)
i <- 1
for(i in 1:n_eq_used) {
dat_eq <- dat_used[dat_used$eq == i,]
coordinates(dat_eq) = ~X_stat + Y_stat
v_lin <- variogram(deltaWS_sim ~ 1, dat_eq)
fit_v_lin <- fit.variogram(v_lin, model = vgm("Exp"))
res_variogram[i,] <- c(i, fit_v_lin$psill, fit_v_lin$range)
}Below are histograms of the estimated (logarithmic) length-scales and nugget values. Log(10) is about 2.3, so the estimated length scales cluster around that value.
p <- ggplot(as.data.frame(res_variogram)) +
geom_histogram(mapping = aes(V5)) +
scale_x_log10() +
labs(x="estimated length scales", y="#")
print(p)## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p <- ggplot(as.data.frame(res_variogram)) +
geom_histogram(mapping = aes(V2)) +
scale_x_log10() +
labs(x="estimated nugget values", y="#")
print(p)## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 109 rows containing non-finite values (stat_bin).
tmp = as_draws_df(subset(draws_l2, variable=c('^ell'), regex=TRUE))
df_plot <- data.frame(ell_vario = res_variogram[,5],
n_eq = as.numeric(table(eq)),
dell_vario = log(res_variogram[,5]) - log(10),
ell_est_l2 = colMeans(tmp)[1:n_eq_used],
dell_est_l2 = log(colMeans(tmp)[1:n_eq_used]) - log(10)
)
p <- ggplot() +
geom_point(data = df_plot, aes(x = n_eq, y = ell_est_l2), color = 'black') +
geom_point(data = df_plot, aes(x = n_eq, y = ell_vario), color = 'orange') +
scale_x_log10() + scale_y_log10() +
labs(x="number of records", y=" length scale")
print(p)
p <- ggplot(df_plot) +
geom_density(mapping = aes(ell_vario),color = 'orange') +
geom_density(mapping = aes(ell_est_l2)) +
scale_x_log10() +
labs(x="estimated length scale", y="#")
print(p)Now some plots of variograms of well recorded events (number of records more than 80).
omega_sim <- 0.75
theta1_sim_sq <- omega_sim * phi_ss_sim^2
theta2_sim_sq <- (1-omega_sim) * phi_ss_sim^2
ell_sim <- 10
eq80 <- as.numeric(names(which(table(dat_used$eq) > 80)))
for(i in eq80) {
dat_eq <- dat_used[dat_used$eq == i,]
coordinates(dat_eq) = ~X_stat + Y_stat
v_lin <- variogram(deltaWS_sim ~ 1, dat_eq)
fit_v_lin <- fit.variogram(v_lin, model = vgm("Exp"))
preds <- variogramLine(fit_v_lin, maxdist = max(v_lin$dist))
preds$gamma_sim <- theta2_sim_sq + theta1_sim_sq * (1 - exp(-preds[,1] / ell_sim))
if(abs(log(ell_sim) - log(fit_v_lin$range[2])) < 0.5) {
col = 'black'
} else {
col = 'blue'
}
p <- ggplot() +
geom_point(data = v_lin, aes(x = dist, y = gamma, size = np)) +
geom_line(data = preds, aes(x = dist, y = gamma), color = col) +
geom_line(data = preds, aes(x = dist, y = gamma_sim), color = 'red') +
labs(title = paste('event ',i), x = 'h' , y = 'variogram')
print(p)
}## Warning in fit.variogram(v_lin, model = vgm("Exp")): No convergence after 200
## iterations: try different initial values?
## Warning in fit.variogram(v_lin, model = vgm("Exp")): No convergence after 200
## iterations: try different initial values?
As before, the full Stan model can recover the paremeters quite well, but when estimating on just \(\delta WS\), the length scale is underestimated. The variogram estimates are all over the place, which might be sort of expected, but for well-recorded events things look ok (I am again a bit surprised by this). The nugge values are quite bad though. When estimating a different length scale for the events with a Bayesian GP model (non-hierarchical), most length scales are close to 10km; the prior probably helps, though it is quite wide. For that model I also assume the same variance for all events, so that probably helps too. The random effects model for the length scales had problems converging, probably because sigma_log_ell is actualy zero, so we get some issues there.
Unversity of California, Los Angeles, kuehn@ucla.edu↩︎